Journal of Bioinformatics and Systems Biology — Latest Matching Preprints

1

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.1%

1.6%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

2

cran2crux: automatically create CRUX ports for R-packages

Petrov, P.; Izzi, V.

2026-05-13 bioinformatics 10.64898/2026.05.09.723963 medRxiv

Top 0.1%

1.3%

Show abstract

MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.

3

Evolutionary analysis of vertebrate KCNH voltage-gated potassium channels and their expression in zebrafish embryos

Wu, K.; Wang, D.; Dong, Z.; Zhou, A. Y.; Zhang, G.

2026-05-24 developmental biology 10.64898/2026.05.21.726828 medRxiv

Top 0.1%

1.1%

Show abstract

Voltage-gated potassium channels (Kv) are a large family of potassium channels composed of 40 members across 12 subtypes. The KCNH genes encode 3 subfamilies of voltage-gated potassium channels: Kv10 (EAG, ether a go go), Kv11 (ERG, EAG-related gene), and Kv12 (ELK, EAG-like K). Kv channels play prominent roles in the neuronal and cardiovascular systems. Mutations in Kv channels have been linked to many human diseases, such as epilepsy, heart arrhythmias, and cancers. Significant progress has been made in understanding protein structures, physiological functions, and the pharmacological modifiers. However, the evolutionary history and gene expression of vertebrate KCNH genes during embryonic development remain largely unknown. We systematically identified and cloned 14 kcnh genes in zebrafish. Then, we examined vertebrate KCNH channel evolution by phylogenetic and syntenic analyses. Our data revealed that the three subtypes of the KCNH gene family have already evolved in invertebrates, long before the emergence of vertebrates. The number of vertebrate KCNH genes increased, most likely due to whole-genome duplications (WGDs). In addition, we examined zebrafish kcnh gene expression during early embryogenesis by in situ hybridization. Each subgroups genes showed similar but distinct gene expression domains with some exceptions. Most of them were expressed in neural tissues. Notably, kcnh6a showed robust expression in the developing heart, consistent with its conserved role in cardiac repolarization. Additionally, a few kcnh genes were transiently expressed in nonneural tissues, such as somites and the notochord, suggesting they may have a unique role in embryonic development. Our phylogenetic and developmental analyses of KCNH channels shed light on their evolutionary history and potential roles during embryogenesis, in line with their physiological functions and human channelopathies.

4

HydraMPP: A lightweight library for distributed massive parallel processing in Python - threading at scale.

Figueroa, J. L.; White, R. A.

2026-06-08 bioinformatics 10.64898/2026.06.04.730204 medRxiv

Top 0.2%

1.1%

Show abstract

We now exist in the era of massive datasets from genomics, large language models, and all the known knowledge of humanity right at our fingertips. Much of this data is becoming more accessible; however, processing such data remains an ongoing issue across systems including high performance computing (HPC) infrastructures. Massively parallel computing (MPP) has solved this using a divide and conquer approach by splitting workloads across independent nodes (i.e., central processing units (CPU) allowing for higher scaling of data). The main engine for this in python is Ray; however, it has many issues including a large code space, security issues, debugging opacity, and memory management issues. Here, we present HydraMPP, a lightweight, ease of use and utilization, with high auditability, and with SLURM ergonomics.

5

Semi-automated reconstruction of glomerular architecture from 3D confocal microscopy data

Loyd, Y. M.; Chase, S. E.; Krendel, M.

2026-07-10 cell biology 10.64898/2026.07.03.736410 medRxiv

Top 0.2%

1.0%

Show abstract

Nephrons are the functional units of the kidney; within each nephron, the glomerulus is the initial site of selective filtration that allows removal of waste products while preserving proteins in the bloodstream. Each glomerulus consists of a network of capillaries surrounded by specialized epithelial cells, podocytes, which mediate selective filtration. Abnormalities in glomerular structure impair renal function, resulting in proteinuria and kidney disease. Although several microscopy-based approaches exist to characterize glomerular architecture and structural abnormalities, quantitative analysis is often limited by labor-intensive image segmentation. In this study we present a semi-automated approach for segmentation and analysis of glomerular architecture from three-dimensional confocal microscopy data. Using mTmG transgenic mice that express membrane-associated EGFP in podocytes and membrane-associated tdTomato across all other cell types, we reconstruct podocyte processes and glomerular capillaries from volumetric renal images. This semi-automated approach reduces manual segmentation effort and supports more efficient, standardized analysis of glomerular architecture in three-dimensional confocal microscopy datasets.

6

High throughput single-cell RNA sequencing of intact adult cardiomyocytes and non-myocytes using a split-pool approach

Hu, Y.; Gurung, R.; Mueller, S.; Villanueva, E.; Stenzig, J.; Rayan, N.; Luu, T. D. A.; Nur, S.; Tan, B.; Liu, B.; Yu, H.; Choi, H.; Foo, R.; Ackers-Johnson, M. A.

2026-04-30 cell biology 10.64898/2026.04.28.721288 medRxiv

Top 0.2%

1.0%

Show abstract

MOTIVATIONAdult cardiomyocytes are difficult to profile by whole-cell single-cell RNA sequencing because of their large size and fragility, which make them poorly compatible with standard workflows. Current approaches for adult cardiomyocyte transcriptomics often require a trade-off between data quality and throughput, thus, studies instead rely heavily on sequencing of nuclei alone. Therefore, we set out to develop a high-quality and scalable workflow for adult heart cells using in-cell ligation and split-pool barcoding strategies to address this methodological gap. This workflow may be further generalisable to other large cell types or samples containing cell populations with highly unequal RNA content. SUMMARYAdult cardiomyocytes are difficult to profile by whole-cell single-cell RNA sequencing (scRNA-seq). Here, we developed a high-quality and scalable workflow for adult heart cells using in-cell ligation and split-pool barcoding. We identified per-cell RNA content as a significant variable that must be accounted for. Separation of cardiomyocytes (large cells) and non-cardiomyocytes (small cells) before library construction, and allocation of deeper sequencing to cardiomyocytes, produced high-quality whole-cell datasets for both compartments. Compared with single-nucleus RNA sequencing, whole-cell cardiomyocyte profiling better recovered metabolic, mitochondrial, cytoplasmic translational, and contractile gene programs. This workflow provides a practical method for scalable, high-quality cardiomyocyte whole-cell scRNA-seq and offers general strategies for other large cell types or samples containing cell populations with highly unequal RNA content.

7

Artificial Intelligence-Based Chatbots in Genetic Counseling Practice: Current Uptake, Utilization, and Perspectives

Daley, N.; Griswold, A.; Moreno, L.; Floyd, A.; Duong, D.; Solomon, B. D.; Waikel, R. L.

2026-05-24 genetic and genomic medicine 10.64898/2026.05.21.26353789 medRxiv

Top 0.2%

0.9%

Show abstract

AI-driven chatbots have been utilized in healthcare to automate administrative tasks, improve patient education, and expand access to medical information; however, their role in genetic counseling remains underexplored. To investigate the adoption, perceptions, and potential utility of AI-based chatbots in genetic counseling practice, 217 genetic counselors and genetic counseling students from across North America were surveyed regarding chatbot usage, confidence in their application, and perceived benefits and limitations. While most participants (166/217; 76.5%) reported using general AI chatbots outside of clinical settings, far fewer (18/204; 8.8%) reported using or recommending clinical genetics chatbots in clinical practice. For those that used clinical genetics chatbots, the primary purpose was for communication with at-risk family members (11/18; 61.1%) and patient education (10/18; 55.6%). Confidence in chatbot technology varied, with highest confidence in gathering family history information (81/199; 40.7%) and lowest confidence in their ability to disclose variants of uncertain significance or positive genetic testing results (5/199; 2.5%). The greatest perceived benefits included reducing repetitive tasks (165/195, 84.6%) and allowing for time for other tasks (141/195; 72.3%), while major concerns revolved around patient comprehension (167/195; 85.6%) and having accurate, up-to-date information (145/195; 74.4%). Despite some concern about AI replacing human counselors, most participants reported they felt there was potential for chatbots to enhance workflow efficiency (128/195; 65.6%) if properly integrated and regulated. Limited AI training was identified as a barrier to adoption (16/195; 8.2% received training), highlighting a need for structured education on AI applications in genetic counseling. These findings suggest that AI chatbots hold promise as supplementary tools, but significant challenges must be addressed before widespread implementation in genetic counseling practice.

8

Simple Electroporation of Chlamydomonas reinhardtii Strains with an Intact Cell Wall

Messmer, M.; de Carpentier, F.; Lam, E.; Hong, M.; Wakao, S.; Schroda, M.; Niyogi, K. K.

2026-05-05 molecular biology 10.64898/2026.04.30.721989 medRxiv

Top 0.2%

0.9%

Show abstract

Chlamydomonas reinhardtii is a model green alga extensively used to study photosynthesis and cilia using molecular biology and genetics. Electroporation is a very common technique to transform DNA into the nuclear genome, which is essential to generate mutant collections and express transgenes. Here, we describe a simple, fast, and efficient protocol to transform strains with an intact cell wall. It achieves a good transformation efficiency without cell wall digestion or use of commercial kits and is compatible with the widely available Gene Pulser electroporation system. Key featuresO_LIHigh transformation efficiency of Chlamydomonas reinhardtii strains with an intact cell wall. C_LIO_LIFaster than currently available electroporation protocols. C_LI

9

SaVanache: indexing and visualizing pangenome variation graphs

Mohamed, M.; Durant, E.; Rouard, M.; Muller, C.; Monat, C.; Conte, M.; Sabot, F.

2026-05-08 bioinformatics 10.64898/2026.05.05.722901 medRxiv

Top 0.3%

0.8%

Show abstract

With the rapid increase in genome sequencing and the growing availability of genomic resources, genomics is shifting toward pangenome representations that capture intra- and inter-specific diversity by integrating multiple genomes into a single entity. These pangenomes are increasingly modeled as graphs, encoding complex genomic variations in structures such as de Bruijn or variation graphs. However, while genome browsers provide standard and effective solutions for visualizing single or limited numbers of genomes, equivalent interactive tools for graph-based pangenomes remain limited, particularly for variation graph models. We developed SaVanache, a multi-resolution visualization interface designed to explore pangenome variation graphs at various depths. SaVanache enables the exploration of both global diversity and structural variations (SVs) across genomes relative to a user-defined linear pivot genome. Unlike synteny viewers, SaVanache emphasizes variations by representing SV types through a dedicated set of glyphs, facilitating intuitive one-to-many comparisons. To support smooth exploration, SaVanache preprocesses a Graphical Fragment Assembly (GFA) pangenome file into optimized index and data structures, enabling fast, real-time queries on large pangenome graphs. By combining advanced visualization techniques with efficient data handling, SaVanache provides a robust tool for scientists to analyze and visualize genetic variation within genomes and pangenomes, facilitating the identification of genetic determinants associated with phenotypes of interest and fully exploiting current genomic resources. Author summaryWe introduce SaVanache, an innovative tool that transforms the way we explore genomic resources. SaVanache allows visualization and analysis of pangenome variation graphs (PVGs), which capture genomic diversity by integrating structural variants (SV) and single nucleotide polymorphisms (SNPs) across multiple genomes. Unlike traditional genome browsers limited to a few genomes, SaVanache offers a multi-level, user-friendly interface that allows users to explore from whole pangenomes down to individual structural variants, enabling multidimensional research and development. Using a linear pivot genome as a visual reference, SaVanache simplifies complex PVG structures into intuitive comparisons. It efficiently handles large datasets and speeds up data retrieval through internal parsing. The front-end, built with modern JavaScript frameworks, provides interactive and responsive visualization, while the Python/Django backend supports real-time data updates. Users can detect and classify SVs by comparing syntenic segments between genomes, visualized through a novel glyph-based system that uses shapes and colors to represent complex rearrangements. SaVanache supports seamless zooming from chromosome-wide to nucleotide-level views, interactive diversity scatterplots, dynamic pivot genome switching, and grouping genomes by metadata to explore genotype-phenotype links. In addition, export functions bridge visualization with downstream bioinformatics. Developed with user feedback, SaVanache balances biological relevance and computational efficiency, overcoming PVG complexity to empower users with unprecedented insight into genomic diversity and SVs.

10

Programmatic access to ICTV virus taxonomy through a public ontology API

Lieutaud, P.; McLaughlin, j.; Hendrickson, R. C.; David, R.; Parkinson, H.; Lefkowitz, E.; Dempsey, D.; Coutard, B.

2026-06-16 bioinformatics 10.64898/2026.06.16.732600 medRxiv

Top 0.3%

0.8%

Show abstract

The International Committee on Taxonomy of Viruses (ICTV) is responsible for developing and maintaining a universal virus taxonomy. As the reference framework for organising the viral world, it is essential for virology and related fields. Despite its widespread use in research and public health, programmatic access to ICTV taxonomy has remained limited, posing challenges for integration, versioning, and interoperability across databases and bioinformatics resources requiring up-to-date virus taxonomy. To address this, we developed a public and sustainable solution leveraging ontology-based APIs. Successive ICTV Master Species List (MSL) releases were transformed into a structured ontology and deployed as a unified representation through the Ontology Lookup Service (OLS). The framework also provides ICTV-NCBI mappings and helper libraries for integration into downstream systems. This enables, for the first time, public programmatic retrieval of current and historical virological taxon names, taxonomic relationships, metadata, and persistent identifiers through stable endpoints. More broadly, this work illustrates a general strategy for transforming structured biological datasets into semantically enriched graph resources exposed through scalable public APIs. These developments enhance interoperability, reduce manual curation, and support FAIR-aligned taxonomic data management in virology and pandemic preparedness. Key pointsO_LIICTV provides the official taxonomy for classifying viruses and naming virus taxa, but lacks standardised programmatic access. C_LIO_LITransforming ICTV data into an ontology enables semantic, machine-actionable access across releases via ontology-based APIs. C_LIO_LIICTV-NCBI mappings support interoperability across bioinformatics resources. C_LIO_LIThe framework enables programmatic resolution of current and historical viral taxa. C_LIO_LIThis approach provides a reusable model for exposing biological datasets through public APIs. C_LI

11

VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv

Top 0.3%

0.8%

Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

12

A portable molecular laboratory for rapid genotyping in the field: application to sickle cell disease

Grunder, F.; Haemmerli, A.-F.; Bokembya, C. I. N.; Hennart, S.; Helmers, M.; Porret, N. A.; Graz, B.; Choudja Ouabo, C.; Abriel, H.

2026-05-12 genetic and genomic medicine 10.64898/2026.05.05.26352080 medRxiv

Top 0.3%

0.6%

Show abstract

BackgroundSickle cell disease (SCD) is the most common recessive genetic disorder, caused by pathogenic variants of the HBB gene. SCD is associated with a range of clinical manifestations, including vaso-occlusive crises, infections, and severe anaemia, which contribute to increased morbidity and mortality. The frequency of pathogenic alleles is high in Sub-Saharan African countries, with heterozygous carriers reaching up to 25% of the population. Several methods can be employed for molecular diagnostics, with HBB gene sequencing being the most precise. However, access to DNA analyses and sequencing in Low- and Middle-Income Countries (LMICs), where SCD prevalence is high, is limited. Understanding genetic profiles is crucial at both individual and population levels, as it can guide public health strategies and facilitate accurate genetic counselling. AimThis feasibility study aimed to demonstrate that a portable medical genetic laboratory (in suitcases) can be used to genotype individuals for the HBB A, S, and C alleles and their combinations within a few hours outside of a laboratory setting. Methods and resultsWe established a portable medical genetics laboratory capable of DNA extraction and isothermal DNA amplification using a commercially available kit for the A, S, and C alleles of the HBB gene. During one single study day, this portable lab was set up in a room where the Swiss Association of Patients with SCD was holding its annual meeting. We analysed the samples of 27 participants who were aware of their A, S, or C status. We collected buccal swabs and dried blood samples for genotyping. Genotype results for all participants were obtained within five hours after sample collection. In four cases, we observed discrepancies between the buccal swab and blood genotypes; three were resolved upon repeat testing, and one reflected donor chimerism following hematopoietic stem-cell transplantation. ConclusionsThis study demonstrates the feasibility and efficiency of using a portable medical genetics laboratory for rapid genotyping of HBB SCD alleles in community settings.This approach can improve access to molecular diagnostics in resource-limited environments. Such tools have the potential to significantly enhance local capabilities for genetic screening, counselling, and public health planning in regions heavily affected by SCD.

13

Assessing the efficacy of human mesenchymal stromal cells of different tissue origins in a mouse model of kidney ischaemia reperfusion injury

Trivino-Cepeda, K.; Amadeo, F.; Hughes, D. M.; Ressel, L.; Garcia-Finana, M.; Hanson, V.; Taylor, A.; Murray, P. A.; Wilm, B.

2026-06-20 physiology 10.64898/2026.06.19.733188 medRxiv

Top 0.3%

0.6%

Show abstract

Rodent models of kidney disease have been widely used to assess the efficacy, safety and mode of action of mesenchymal stromal cells (MSCs) as therapies. However, because kidney disease models, MSC type and the methods used to assess kidney injury tend to differ between research groups, it is difficult to obtain data that are sufficiently robust and reproducible to support clinical translation. We present here for the first time a side-by-side analysis of the performance of human MSCs derived from the most commonly used tissue sources, bone marrow (BM-), adipose- (A-) and umbilical cord (UC-), in a kidney ischaemia reperfusion injury (IRI) model in mice. For each animal, we performed a comprehensive assessment of kidney function and health by longitudinal transdermal measurements of sinistrin clearance, serum biomarker levels at the experimental endpoint, and histopathological scoring of sections from left and right kidneys. Furthermore, we tracked the MSCs by bioluminescence imaging in the injured mice to determine their viability over time and their capacity for homing to the damaged kidneys. Our results reveal that only modest if any beneficial effects of the MSC treatments were detectable on kidney function and histology, irrespective of cell type administered. Furthermore, all three MSC types were sequestered in the lungs without reaching the kidneys, and had completely disappeared within 7 days. Our data suggest that none of the MSC types has the capability to improve renal health following IRI to a meaningful extent, questioning their suitability as a clinical therapy. Significance StatementMSCs have been proposed as efficacious cell therapies in murine models of kidney disease, with potential for clinical translation. We compare efficacy of human MSCs of different tissue origins (adipose, bone marrow and umbilical cord) in a refined mouse model of renal IRI. Only modest if any beneficial effects on kidney function and histology were detectable for all three cell types, and cells did not reach the kidneys but sequestered in the lungs where they died.

14

Drivers of Diagnostic Variation in a Digital Global Kidney Transplant Reader Study

Hofstraat-Boersma, R.; du Long, R.; Buzzanca, G.; Abiola, A. A.; Albadri, S.; Ali, Z.; Altaleb, A.; Angioi, A.; Banu, S. G.; Barry, M.; Bhalodia, A. R.; Bianco, P.; Broecker, V.; Buelow, R.; Chauveau, B.; Chen, G.; Cheunsuchon, B.; Crisi, G. M.; Daneshvar, S.; Dendooven, A.; Dokouhaki, P.; Drachenberg, C. B.; Farris, A. B.; Ferlicot, S.; Florquin, S.; Fontana, F.; Gibier, J.-B.; Gibson, I. W.; Gujarathi, S.; Hendricks, A. R.; Husain, S.; Islam, J.; Ismail, W.; Jagannathan, G.; Klager, J.; Kozakowski, N.; Krizova, A.; Kurien, A. A.; Kwon, B.; L'Imperio, V.; Ledesma, F. L.; Low, J. P.; Martin, J

2026-07-13 pathology 10.64898/2026.07.09.26357318 medRxiv

Top 0.4%

0.6%

Show abstract

Background Diagnostic interpretation of kidney allograft biopsies using the Banff classification remains variable, but the determinants of this variability are not fully defined. We performed a global, fully digital multi-reader study to identify the principal drivers of disagreement in Banff-based assessment. Methods Thirty six kidney transplant biopsies were independently scored by 67 renal pathologists on a standardized digital platform. Readers assessed Banff lesions on hematoxylin and eosin, periodic acid Schiff, and Jones' silver stains; final diagnostic categories were assigned using prespecified Banff-based decision rules. Interobserver agreement was quantified with Gwet's agreement coefficient (AC) statistics. Determinants of diagnostic agreement were evaluated) using pairwise mixed-effects logistic regression, and reader similarity was examined by principal component analysis (PCA) with post hoc molecular annotation. Results Agreement for final diagnostic categories was moderate (Gwet's AC1, 0.55; 95% CI, 0.47 - 0.63). Lesion-level agreement varied substantially, with lowest agreement for selected threshold-dependent inflammatory or semi-quantitative lesions, including interstitial inflammation in areas of IFTA, peritubular capillaritis and arteriolar hyalinosis. Diagnostic concordance differed markedly across biopsies, indicating strong case-level heterogeneity. In pairwise models, differences in active inflammatory and vascular lesion scoring were the strongest correlates of diagnostic disagreement; reader experience and geography contributed minimally. Principal component analysis showed reader variation was organized along two dominant axes: a rejection-calling threshold axis linked mainly to tubulointerstitial inflammatory injury, and a T cell-mediated (TCMR/TI) and antibody-mediated/microvascular (AMR/MVI) inflammation-oriented phenotypic classification axis. Conclusion Interobserver variation in Banff-based kidney transplant biopsy assessment is structured rather than random and driven mainly by how readers threshold and integrate key inflammatory lesion compartments rather than experience or geographic location.

15

KBase Research Agent: Automated Multi-Agent Workflow Construction for Reproducible Genome Analysis

Gupta, P.; Riehl, W. J.; Cashman, M.; Chivian, D.; Neely, C. J.; Canon, S. R.; Cottingham, R.; Henry, C.; Arkin, A. P.; Dehal, P. S.

2026-06-04 bioinformatics 10.64898/2026.06.01.729336 medRxiv

Top 0.4%

0.6%

Show abstract

Constructing multi-step bioinformatics workflows, from read quality control through genome assembly to functional annotation, requires expertise in both biology and computational tool selection, creating a bottleneck for scalable and reproducible analysis. We present the KBase Research Agent, a multi-agent system for automating such workflows within the DOE Systems Biology Knowledgebase (KBase). Given a set of sequencing reads and a research objective, the agent constructs an analysis plan grounded in KBase documentation and a Knowledge Graph (KG) of the KBase application catalog, then selects, parameterizes, validates and executes appropriate KBase applications to carry out the workflow. The resulting analysis is preserved as a reproducible KBase Narrative. We evaluate the systems planning and execution quality against ground truth constructed from reference workflows derived from peer-reviewed Microbiology Resource Announcements. We further apply the agent to 100 previously unanalyzed bacterial isolate genomes from the JGI IMG/M database, where it autonomously performed read quality control, genome assembly, taxonomic classification with GTDB-Tk, and downstream analysis producing annotated genomes, reproducible Narratives, and draft manuscripts without human intervention. Across these experiments, the KBase Research Agent demonstrates the feasibility of domain-grounded, end-to-end scientific workflow automation in a production bioinformatics platform.

16

ScriptManager: a platform for scalable and reproducible high-resolution analysis of genomics datasets

Lang, O. W.; Beer, B.; Zhang, D.; LeSon, C.; Deen, A.; Pugh, F.; Lai, W. K.

2026-06-18 bioinformatics 10.64898/2026.06.14.732163 medRxiv

Top 0.4%

0.6%

Show abstract

BackgroundThe growing diversity of genomic and epigenomic assays has driven a parallel expansion in data formats, analysis workflows, and figure-generation tools. However, tools for analyzing data and assembling publication-quality figures are often specialized to a specific assay, dramatically limiting their interoperability and reproducibility. ResultsWe present the v1.0 release of ScriptManager, a Java-based framework for modular and reproducible analysis and visualization workflows of genomics and epigenomics data. Unlike existing tools specialized for individual assay types, ScriptManager provides a unified and extensible framework for cross-assay visualization and workflow reproducibility. The v1.0 release adds novel analytical modules, GUI session logging, automated unit and integration testing, tutorials, and expanded documentation. It also integrates with the broader reproducibility ecosystem through Singularity containers, Anaconda packaging, and Galaxy XML wrappers. We demonstrate ScriptManagers TagPileup scaling from local single-core execution to a 10,305-job analysis distributed across the Open Science Grid (OSG), with the full workload completing in <2 hours of wall-clock time. ConclusionsScriptManager v1.0 enhances workflow portability, transparency, and reproducibility across a diverse range of high-resolution genomic assays. By coupling a flexible module design with modern reproducibility standards, ScriptManager provides a bridge between exploratory data analysis and formal, publication-ready figure generation. These improvements enable researchers to build, share, and reproduce genomic analyses across diverse computational infrastructures with minimal configuration.

17

Application of Computer Vision Tools to Maize Genomic Data for Trait Prediction and Gene Discovery

Higgins, S. A.; Anible, E.; Muthupari, M.; Dibble, C.; Murdoch, R. W.

2026-05-26 bioinformatics 10.64898/2026.05.21.726890 medRxiv

Top 0.4%

0.6%

Show abstract

Artificial intelligence and machine learning for computer vision (CV) and image recognition is a rapidly evolving field with multiple potential applications in plant genomics. While CV has been widely adopted by the research community for plant phenotyping and disease surveillance, applications of CV tools to plant genome analysis are underrepresented. CV tools may complement traditional statistical classification tools used in plant genomics, since CV perceives problems holistically rather than granularly (in terms of pattern recognition), which is particularly applicable to analysis of large, complex eukaryotic genomes. In this study, we report on a new strategy to apply existing CV tools to classify plant genotypes and predict genotype-phenotype relationships. A technique was developed for converting maize genome resequencing data into a set of images reminiscent of a quick response (QR) code. Several hundred maize genomes were processed and it was demonstrated that CV models can successfully categorize genome images into heterotic groups (accuracy and recall > 0.8). Models for classifying genome images into phenotypic trait groups (such as short, medium, and high plant height) performed with moderate success for the most heritable trait analyzed (ear height; accuracy and recall > 0.5). Querying model results permitted identification of genome regions that were important for model classification predictions. The CV model results revealed enriched metabolic pathways consistent with traits under consideration. Overall, our initial application of CV tools to plant genome analysis highlights its applicability to genomic data. Design of new CV architectures optimized for genome-derived images may further improve upon our initial results generated using only off-the-shelf CV tools optimized for unrelated image analysis tasks. Core ideasO_LIAI/ML computer vision (CV) tools were applied to encoded maize genomes C_LIO_LICV image classification tools were able to successfully classify encoded genomes into heterotic groups C_LIO_LITrait values of maize strain ear height could be predicted with moderate success C_LIO_LIGenome regions encoding plausible metabolic pathways used by the classifier were identified C_LIO_LIRecommendations for improved success of CV for genotype-to-phenotype are discussed C_LI

18

Reproducible and shareable bioinformatics pipelines from natural-language prompts

Kim, H.-M.; Jeong, H.; Mekonnen, A. M.; Kim, Y.; Oh, Y.; Lee, H.; Jung, C.; Park, J.

2026-06-01 bioinformatics 10.64898/2026.05.28.719125 medRxiv

Top 0.4%

0.6%

Show abstract

Large language models (LLMs) are increasingly used to generate bioinformatics pipelines and to carry out analyses from natural-language prompts. However, the resulting analyses are often difficult to reproduce across sessions, owing to the non-deterministic nature of LLM-driven conversations and heterogeneity of local execution environments, and cannot run on remote high-performance computing (HPC) servers or be shared and reused. We present Autopipe, a platform that guides any Model Context Protocol (MCP) - compatible LLM to produce, execute, and publish source-preserved, re-executable containerized pipelines. Autopipe enables users to execute bioinformatics pipelines on any on-premises remote servers - supported by comprehensive setup documentation aimed at researchers without prior server-administration experience - and to visualize results through an extensible web-based viewer. The Autopipe platform comprises four components: a desktop application with an embedded MCP server for pipeline management and remote execution, an online registry for pipeline and plugin discovery, a web-based result viewer, and a CLI tool for customizing viewer plugins. Autopipe turns conversational analysis into re-executable and shareable workflows. Autopipe is freely available at https://autopipe.org/.

19

SPACKLE: A spatial-first framework for multi-layer spatial transcriptomic analysis

Maynard, T. M.

2026-05-29 bioinformatics 10.64898/2026.05.26.727917 medRxiv

Top 0.4%

0.6%

Show abstract

BackgroundThe emergence of accessible spatial transcriptomic platforms such as 10x Genomics Visium HD and Xenium has created demand for analysis tools that can handle the complexity and scale of spatial datasets. Current frameworks approach spatial data primarily as an extension of single-cell RNA-seq pipelines, where spatial coordinates are retained as metadata rather than treated as a first-class organizing principle. As a result, common tasks such as multi-modal data alignment, region-of-interest selection, and cross-resolution visualization require manually managing disparate data types, coordinates, and scales, making spatial analysis unnecessarily time-consuming and error-prone. ResultsWe present SPACKLE (Spatial Platform for Analysis of Composite stacKs and Layered data Extraction), a Python-based "spatial-first" framework that treats absolute physical micron coordinates as the organizing principle for all data types. All data - morphology images, transcript point clouds, expression matrices, segmented cells, and user-defined regions - are stored as typed objects ("Channels") that carry their own spatial metadata, keeping all layers in automatic registration regardless of platform, resolution, or analysis operation. Two complementary interfaces simplify access to underlying data: the ViewPort, a compositing engine for efficient multi-channel visualization, and the DataPort, which extracts raw data in its native format for downstream analysis. A set of spatial analysis tools demonstrates the practical benefits of the framework, including ROI-based expression binning, cortical unfolding, and sub-micron fine alignment of transcript and image data. The use of modern Python data management methods helps maintain the efficiency of the framework, allowing for quick visualizations and analysis with a low memory footprint. ConclusionsSPACKLE is designed to complement rather than replace widely used tools in the spatial analysis ecosystem (Scanpy, Squidpy, CellPose, StarDist), by handling the spatial mechanics of large datasets so that the analyst can focus on the biology. SPACKLE is freely available under the MIT license at https://github.com/maynardt/spackle.

20

Transcriptomic response to histone deacetylase inhibitors in cultured feline cells

Tanaka, G.; Nakamura, S.; Goto, R.; Kubota, A.; Sakamoto, N.; Awazu, A.

2026-06-10 cell biology 10.64898/2026.06.08.731028 medRxiv

Top 0.5%

0.5%

Show abstract

ObjectiveIn recent years, the number of cats kept as companion animals has increased, leading to a growing demand for veterinary care. Although some histone deacetylase (HDAC) inhibitors are promising for the treatment of human cancers and neurological diseases, comprehensive systematic research on HDAC inhibitors in domestic cats remains insufficient. Therefore, this study aimed to investigate the effects of HDAC inhibitors on the transcriptome of feline cells. MethodsTwo types of cells derived from domestic cats, Crandell-Rees Feline Kidney (CRFK; kidney-derived) cells and PG-4 cells (astrocyte-derived), were treated with four HDAC inhibitors (panobinostat, trichostatin A, valproic acid, and vorinostat) for 24 h. Transcriptomic changes after treatment were examined using RNA sequencing. ResultsHDAC inhibitor treatment upregulated the expression of intercellular chemical interactions and signal transduction-related genes, similar to observations in human cells. Although HDAC inhibitors did not suppress the expression of cell cycle-related genes in CRFK cells, as observed in human cells, the inhibitors downregulated the expression of organogenesis-related genes. Consistent with observations in human cells, HDAC inhibitors suppressed the expression of cell cycle- and cancer-related genes in PG-4 cells. Importantly, valproic acid, which is thought to be more effective for neurological diseases than for cancer, suppressed the expression of more cancer-related genes in PG-4 cells than the other three HDAC inhibitors. Conclusion and relevanceOur findings revealed that the responses of cells derived from feline organs to various HDAC inhibitors varied considerably depending on the organ of origin and species. Since few studies, including human studies, have comprehensively compared transcriptomic responses to multiple HDAC inhibitor classes across multiple cell types, the results of this study provide a foundation for future research on the treatment and prevention of cancer and neurological diseases in domestic cats and other mammals.